Titanic dataset analysis

Build a predictive model that answers the question"what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).


In [88]:
#Toggle code section in below noteImagek
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 document.getElementById('btn_toggle').value="Show Code";
 } else {
 $('div.input').show();
 document.getElementById('btn_toggle').value="Hide Code";
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input style = "float:right" type="submit" id="btn_toggle">''')
Out[88]:
In [89]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from wordcloud import WordCloud
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from IPython.display import Image

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
In [90]:
# Setting up visualisations
sns.set_style(style='white') 
sns.set(rc={
    'figure.figsize':(12,7), 
    'axes.facecolor': 'white',
    'axes.grid': True, 
    'grid.color': '.9',
    'axes.linewidth': 1.0,
    'grid.linestyle': u'-'},
    font_scale=1.5)
custom_colors = ["#3498db", "#95a5a6","#34495e", "#2ecc71", "#e74c3c"]
sns.set_palette(custom_colors)
In [91]:
Image("Titanic_3.jpg", width = 700, height=50)
Out[91]:

Dataset description

In [92]:
Image("feature_desc.png")
Out[92]:
In [93]:
df_original = pd.read_excel('titanic3.xls')
print ("Loaded the dataset.")
print ("Sample view of the dataframe")
#sample view for dataframe
df_original.head(2)
Loaded the dataset.
Sample view of the dataframe
Out[93]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
In [94]:
#Dataframe size
print (f'Shape of the dataframe {df_original.shape}')
Shape of the dataframe (1309, 14)


There are 1309 records and 14 columns in the original dataset.
Each record is a the details about the person in the ship

Columns names from the dataset.

In [95]:
#Print column names
df_original.columns
Out[95]:
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')
In [96]:
#Create a dummy dataframe for cleaning operations
#This will be the dataframe using for further operations
df_input = df_original.copy()
In [97]:
print ("Null value counts in the dataframe \n")
df_input.info();
Null value counts in the dataframe 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


4 integer columns, 7 String columns, 3 decimal columns are there in the dataset

In [98]:
#Missing values 
sns.heatmap(df_input.isnull(), cbar=False).set_title("Missing values heatmap");
In [99]:
#Missing value counts in the dataset
pd.DataFrame(df_input.isnull().sum()).plot.line().set_title("Number of missing values in the features");
print (df_input.isnull().sum())
pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64


age
fare
cabin
embarked
boat
body
home.dest columns containing the missing values.

In [100]:
#Display the number of unique values in the dataset
print ("Count of unique values in features:")
df_input.nunique()
Count of unique values in features:
Out[100]:
pclass          3
survived        2
name         1307
sex             2
age            98
sibsp           7
parch           8
ticket        939
fare          281
cabin         186
embarked        3
boat           28
body          121
home.dest     369
dtype: int64


survived and sex columns contains 2 distinct values.
pClass and embarked contains 3 distinct values

Features

In [101]:
(df_input.survived.value_counts(normalize=True)*100).plot.barh().set_title("Survived and Deceased people ratio");
print ("Non Survived and Survived value counts:")
df_input.survived.value_counts(normalize=True)*100
Non Survived and Survived value counts:
Out[101]:
0    61.802903
1    38.197097
Name: survived, dtype: float64
In [102]:
#plot the survival ratio
survival_cnt = np.unique(df_input.survived,return_counts = True)[1]
plt.pie(survival_cnt, labels=["Not Survived","Survived"],autopct="%.2f",colors=['red','green']);
plt.title("Ratio of survived vs Not survived");


60% of people Could not able to survive and 38% people only got survived.

In [103]:
print (f"Unique values in pclass {df_input.pclass.unique()}")
print ((df_input.pclass.value_counts()))
Unique values in pclass [1 2 3]
3    709
1    323
2    277
Name: pclass, dtype: int64
In [104]:
fig_class = df_input.pclass.value_counts().plot.pie().legend(labels=["Class 3","Class 1","Class2"], loc="center right", bbox_to_anchor = (2.25, 0.5)).set_title ("People travelling in different classes")


Pclass or Passenger class denotes the travelling class of the passenger. There were three classes named 1,2 and 3.

Majority of the people(709) travelled in 3rd class. 323 people travelled in 1st class and 277 people travelled in 2nd class.

In [105]:
pclass1_survivor_distribution=round ((df_input[df_input.pclass==1].survived ==1).value_counts()[1] / len(df_input[df_input.pclass==1])*100,2)
pclass2_survivor_distribution=round ((df_input[df_input.pclass==2].survived ==1).value_counts()[1] / len(df_input[df_input.pclass==2])*100,2)
pclass3_survivor_distribution=round ((df_input[df_input.pclass==3].survived ==1).value_counts()[1] / len(df_input[df_input.pclass==3])*100,2)
pclass_perc_df=pd.DataFrame({"PercentageSurvived":{"Class1":pclass1_survivor_distribution,"Class2":pclass2_survivor_distribution,"Class3":pclass3_survivor_distribution},
                            "PercentageNotSurvived":{"Class1":100-pclass1_survivor_distribution,"Class2":100-pclass2_survivor_distribution,"Class3":100-pclass3_survivor_distribution} })
print (pclass_perc_df)
pclass_perc_df.plot.bar().set_title("Percentage of people survived on basis of class");
        PercentageSurvived  PercentageNotSurvived
Class1               61.92                  38.08
Class2               42.96                  57.04
Class3               25.53                  74.47
In [106]:
for x in [1,2,3]:
    df_input[df_input.pclass==x]["age"].plot(kind='kde')
plt.title("Age density in classes");
plt.legend(["1st","2nd","3rd"]);
In [107]:
for x in ["male","female"]:
    df_input[df_input["sex"]==x]['pclass'].plot(kind='kde')
plt.title("Gender density in classes");
In [108]:
(df_input.sex.value_counts(normalize=True)*100).plot.bar().set_title("Gender ratio");
print ("Value counts based on gender:")
print (df_input.sex.value_counts(normalize=True)*100)
Value counts based on gender:
male      64.400306
female    35.599694
Name: sex, dtype: float64


There are 64% people are male and 35% people are female.

In [109]:
survived_male_count = df_input [(df_input["survived"] ==1) &(df_input["sex"] == "male")]["sex"].count()
survived_female_count=df_input [(df_input["survived"] ==1) &(df_input["sex"] == "female")]["sex"].count()
total_count = len(df_input["sex"] )
survived_male_perc = (survived_male_count/total_count)*100
survived_female_perc = (survived_female_count/total_count)*100

not_survived_male_perc=100-survived_male_perc
not_survived_female_perc=100-survived_female_perc
In [110]:
df_sex_survive=pd.DataFrame({"survived":{"male":survived_male_perc,"female":survived_female_perc}, "NotSurvived":{"male":not_survived_male_perc,"female":not_survived_female_perc}})
df_sex_survive.plot.barh().set_title("Survival ratio based on gender");
print ("Survival ratios based on gender: \n",df_sex_survive)
Survival ratios based on gender: 
          survived  NotSurvived
male    12.299465    87.700535
female  25.897632    74.102368


87% males are not able to survive.
74% females are not able to survive.

In [111]:
#Age information
print ("Age data information:")
df_input.age.describe()
Age data information:
Out[111]:
count    1046.000000
mean       29.881135
std        14.413500
min         0.166700
25%        21.000000
50%        28.000000
75%        39.000000
max        80.000000
Name: age, dtype: float64
In [112]:
df_input['Age_range']=pd.cut(df_input.age, [0,10,20,30,40,50,60,70,80,90,100])
sns.countplot(x="Age_range",data=df_input,hue="survived",palette=["C1","C0"]).set_title("Age range vs Survived");
plt.legend(labels=["Deceased","Survived"]);
In [113]:
sns.distplot(df_input.age);


Age ranges vary from 1 month to 80 year old.
People in the age range of 20-30 have more casualities than others.

In [114]:
print ("Sibsp data information :")
df_input.sibsp.describe()
Sibsp data information :
Out[114]:
count    1309.000000
mean        0.498854
std         1.041658
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         8.000000
Name: sibsp, dtype: float64


sibsp is the number of siblings or spouse of a person on board. Maximum of 8 siblings/spouses travelled along with one of the tourists.

In [115]:
ss=pd.DataFrame()
ss['survived']=df_input.survived
ss['sibling_spouse']=pd.cut(df_input.sibsp,[0,1,2,3,4,5,6,7,8],include_lowest=True)
(ss.sibling_spouse.value_counts()).plot.bar();


parch is the feature contained the number of parents/children each passenger was travelling with.
A maximum of 9 parents/children travelled along with one of the passenger.
There are more number of passengers who is travelled with maximum 1 parents/childrens
People travelled alone also could not able to survive much.

In [116]:
print ("parch data information")
df_input.parch.describe()
parch data information
Out[116]:
count    1309.000000
mean        0.385027
std         0.865560
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         9.000000
Name: parch, dtype: float64
In [117]:
pc=pd.DataFrame()
pc["parch_bins"]= pd.cut(df_input.parch,[0,1,2,3,4,5,6,7,8,9],include_lowest=True)
pc["survived"]=df_input.survived
In [118]:
print ("parch bins value counts")
print (pc.parch_bins.value_counts())
pc.parch_bins.value_counts().plot.bar().set_title("Parents/Children value counts");
parch bins value counts
(-0.001, 1.0]    1172
(1.0, 2.0]        113
(2.0, 3.0]          8
(4.0, 5.0]          6
(3.0, 4.0]          6
(8.0, 9.0]          2
(5.0, 6.0]          2
(7.0, 8.0]          0
(6.0, 7.0]          0
Name: parch_bins, dtype: int64
In [119]:
x=sns.countplot(data=pc,x="parch_bins", hue="survived",palette=["C1","C0"]).legend(labels=["Deceased","Survived"])
x.set_title("Survival based on number of parents/children")
In [120]:
df_input['Family']= df_input.parch + df_input.sibsp
df_input['is_alone'] = (df_input.Family == 0)
In [121]:
df_input[df_input['is_alone'] ==  True]["survived"].value_counts().plot.bar().set_title("Persons travelled alone vs survival rates");


As the feature 'Ticket' does not provide any additional information, we can remove this feature from dataset.


People who paid more had a high chances of survival.

In [122]:
print ("Fare data information:")
df_input.fare.describe()
Fare data information:
Out[122]:
count    1308.000000
mean       33.295479
std        51.758668
min         0.000000
25%         7.895800
50%        14.454200
75%        31.275000
max       512.329200
Name: fare, dtype: float64
In [123]:
#Create bins for fare
df_input['Fare_Bins']=pd.cut(df_input.fare,bins=[0,7.9,14.45,31.27,512], labels=['Low','Mid','High_Mid','High'])
In [124]:
sns.countplot(x="Fare_Bins",data = df_input, hue="survived",palette=["C1","C0"]).legend(labels=['Deceased',"Survived"]).set_title("Fare bins value counts");


Embarked signifies where the traveler boarded from. There are three possible values for Embark - Southampton,Cherbourg and Queenstown.
Most people boarded from Cherbourg survived than from other places.

In [125]:
print ("Unique Embarked values:")
df_input.embarked.unique()
Unique Embarked values:
Out[125]:
array(['S', 'C', nan, 'Q'], dtype=object)
In [126]:
x=sns.countplot(x="embarked" , data=df_input, hue="survived",palette=["C1","C0"]);
x.set_xticklabels(["Southampton","Cherbourg","Queenstown"]);
x.legend(labels=["Deceased","Survived"]);
x.set_title("Embarked vs Survival rates")
Out[126]:
Text(0.5, 1.0, 'Embarked vs Survival rates')
In [127]:
#Find the number of bodies not able to find 
print ("Number of not survived people whose body is not able to find:")
df_input[df_input["survived"]==0]["body"].isnull().sum()
Number of not survived people whose body is not able to find:
Out[127]:
688

body of 688 people could not able to find/identify

Data Imputation

In [128]:
df_input.embarked.mode()[0]
df_input.embarked.fillna(df_input.embarked.mode()[0],inplace=True)


Embarked feature has two missing values. Since most of the people are boarded from Southampton, the probability of boarding from Southampton is high. Hence imputing the missing values with Southampton.

In [129]:
#Create a new field for salutation / title
df_input["Salutation"] = df_input["name"].apply(lambda name:name.split(",")[1].split(".")[0].strip())
In [130]:
wc=WordCloud(width=1000, height=450,background_color='white').generate(str(df_input.Salutation.values))
plt.imshow(wc,interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()
print ("Salutation value counts:")
df_input.Salutation.value_counts()
Salutation value counts:
Out[130]:
Mr              757
Miss            260
Mrs             197
Master           61
Dr                8
Rev               8
Col               4
Ms                2
Major             2
Mlle              2
the Countess      1
Jonkheer          1
Sir               1
Lady              1
Dona              1
Mme               1
Don               1
Capt              1
Name: Salutation, dtype: int64
In [131]:
grp=df_input.groupby(['sex','Salutation'])
In [132]:
df_input.age=grp.age.apply(lambda x: x.fillna(x.median()))

There are missing values for age feature.
Grouping the sex and salutation features and filling this group median value for missing values for age.

In [133]:
sal_df= pd.DataFrame(
{
    "Survived": df_input[df_input["survived"]==1].Salutation.value_counts(),
    "Total": df_input.Salutation.value_counts(),
})
s = sal_df.plot.barh().set_title("Survival ratio based on Salutaion");


People with salutation 'Mr' had more casualities.

In [134]:
#Filling missing values for cabin with 'NA'
df_input.cabin.fillna('NA', inplace=True)


Filling the missing values for Cabin with 'NA'

Encoding and dropping columns

Using pandas 'get_dummies' we encoded the categorical data.

In [135]:
df_input = pd.concat([df_input,pd.get_dummies(df_input["cabin"],prefix="cabin")], axis=1)
In [136]:
df_input = pd.concat([df_input,pd.get_dummies(df_input["embarked"],prefix="Emb")], axis=1)
In [137]:
df_input = pd.concat([df_input,pd.get_dummies(df_input["Salutation"],prefix="Title")], axis=1)
In [138]:
df_input = pd.concat([df_input,pd.get_dummies(df_input["Fare_Bins"],prefix="Fare")], axis=1)
In [139]:
df_input = pd.concat([df_input,pd.get_dummies(df_input["pclass"],prefix="Class")], axis=1)
In [140]:
df_input['sex']=LabelEncoder().fit_transform(df_input['sex'])
In [141]:
df_input['is_alone']=LabelEncoder().fit_transform(df_input.is_alone)
In [142]:
df_input.drop(["pclass","boat","body","home.dest","fare","cabin","Fare_Bins","name","Salutation","ticket","embarked","Age_range","sibsp","parch","age"], axis=1, inplace=True)
In [143]:
#Print the complete column names
#df_input.info(verbose=True)
In [144]:
#Check for any null values
#df_input.isnull().sum().any()
In [145]:
#dependent variable
y = df_input.survived
#independent variables
x=df_input.drop(["survived"],axis=1)
In [146]:
#train test split
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.2, random_state=1)
print (f'Training data: dependent features shape {x_train.shape} independent feature shape {y_train.shape}' )
print (f'Testing data: dependent features shape {x_test.shape} independent feature shape {y_test.shape}' )
Training data: dependent features shape (1047, 218) independent feature shape (1047,)
Testing data: dependent features shape (262, 218) independent feature shape (262,)
In [147]:
#Initiate classifier
rfc=RandomForestClassifier()
In [148]:
#Train the model
rfc.fit(x_train,y_train)
Out[148]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [149]:
#predict using the model
y_predict=rfc.predict(x_test)
In [150]:
accuracy=accuracy_score(y_predict,y_test)
print (f'Accuracy score {accuracy}')
Accuracy score 0.8053435114503816
In [151]:
plt.figure(figsize=(10,7))
sns.heatmap(confusion_matrix(y_test,y_predict), annot=True, cmap="summer", fmt='3.0f');
plt.title("Confusion matrix for Random Forest")
confusion_matrix(y_test,y_predict)
Out[151]:
array([[141,  15],
       [ 36,  70]], dtype=int64)
In [152]:
#Comparison result to csv file
compare_df=pd.DataFrame({"Actual":y_test,"Predicted":y_predict})
compare_df.to_csv("compare.csv")

Let us check how other models are performing on this data

In [153]:
score_data = {
    "LogisticRegression": cross_val_score(LogisticRegression(), x,y),
    "SVM":cross_val_score(SVC(),x,y),
    "RandomForestClassifier": cross_val_score(RandomForestClassifier(),x,y),
    "DecisionTree": cross_val_score(DecisionTreeClassifier(),x,y),
    "XGradientBoosting": cross_val_score(XGBClassifier(),x,y),
    "KNN": cross_val_score(KNeighborsClassifier(),x,y)
}
df_score_summary = pd.DataFrame(data=score_data)
df_score_summary
Out[153]:
LogisticRegression SVM RandomForestClassifier DecisionTree XGradientBoosting KNN
0 0.515267 0.511450 0.511450 0.515267 0.515267 0.511450
1 0.793893 0.835878 0.778626 0.797710 0.770992 0.580153
2 0.564885 0.839695 0.740458 0.694656 0.729008 0.645038
3 0.721374 0.732824 0.702290 0.706107 0.690840 0.690840
4 0.720307 0.624521 0.628352 0.647510 0.639847 0.655172
In [154]:
score_mean = df_score_summary.mean()
In [155]:
score_mean.sort_values(ascending=False)
Out[155]:
SVM                       0.708874
DecisionTree              0.672250
RandomForestClassifier    0.672235
XGradientBoosting         0.669191
LogisticRegression        0.663145
KNN                       0.616531
dtype: float64


Support Vector Machine is performs better over other models.

In [156]:
print (x_train.shape,x_test.shape,y_train.shape,y_test.shape)
(1047, 218) (262, 218) (1047,) (262,)
In [157]:
svc=SVC()
In [158]:
svc.fit(x_train,y_train)
Out[158]:
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
In [159]:
y_svc_pred = svc.predict(x_test)
In [160]:
accuracy_svc=accuracy_score(y_svc_pred,y_test)
print (accuracy_svc)
0.8091603053435115
In [161]:
plt.figure(figsize=(10,7))
sns.heatmap(confusion_matrix(y_test, y_svc_pred), annot=True,cmap="summer", fmt='3.0f');
plt.title("Confusion matrix for Support Vector Machine");
In [ ]: